We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement\r\ncompute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a\r\nper-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture.\r\nThe key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level\r\nprogramming language, (ii) supports coarse-grainedmultithreading and fine-grained threading while permitting bit-level resource\r\ncontrol, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications.We compare\r\ntemplate-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound\r\ndata-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of templatebased\r\nimplementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study,\r\nwe use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that\r\nour approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with\r\nperformance approaching that of full-custom designs on both FPGA and ASIC platforms.
Loading....